Articles: Bootstrapping Distributional Feature Vector Quality
نویسندگان
چکیده
This article presents a novel bootstrapping approach for improving the quality of feature vector weighting in distributional word similarity. The method was motivated by attempts to utilize distributional similarity for identifying the concrete semantic relationship of lexical entailment. Our analysis revealed that a major reason for the rather loose semantic similarity obtained by distributional similarity methods is insufficient quality of the word feature vectors, caused by deficient feature weighting. This observation led to the definition of a bootstrapping scheme which yields improved feature weights, and hence higher quality feature vectors. The underlying idea of our approach is that features which are common to similar words are also most characteristic for their meanings, and thus should be promoted. This idea is realized via a bootstrapping step applied to an initial standard approximation of the similarity space. The superior performance of the bootstrapping method was assessed in two different experiments, one based on direct human gold-standard annotation and the other based on an automatically created disambiguation dataset. These results are further supported by applying a novel quantitative measurement of the quality of feature weighting functions. Improved feature weighting also allows massive feature reduction, which indicates that the most characteristic features for a word are indeed concentrated at the top ranks of its vector. Finally, experiments with three prominent similarity measures and two feature weighting functions showed that the bootstrapping scheme is robust and is independent of the original functions over which it is applied.
منابع مشابه
Feature Vector Quality and Distributional Similarity
We suggest a new goal and evaluation criterion for word similarity measures. The new criterion meaning-entailing substitutability fits the needs of semantic-oriented NLP applications and can be evaluated directly (independent of an application) at a good level of human agreement. Motivated by this semantic criterion we analyze the empirical quality of distributional word feature vectors and its...
متن کاملTrained Named Entity Recognition using Distributional Clusters
This work applies boosted wrapper induction (BWI), a machine learning algorithm for information extraction from semi-structured documents, to the problem of named entity recognition. The default feature set of BWI is augmented with features based on distributional term clusters induced from a large unlabeled text corpus. Using no traditional linguistic resources, such as syntactic tags or speci...
متن کاملQuantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests
This work develops formal statistical inference procedures for predictions generated by supervised learning ensembles. Ensemble methods based on bootstrapping, such as bagging and random forests, have improved the predictive accuracy of individual trees, but fail to provide a framework in which distributional results can be easily determined. Instead of aggregating full bootstrap samples, we co...
متن کاملYapay biçimbilimsel özniteliklerin hiperspektral görüntü sınıflandırma başarımının üzerindeki etkisi On the effect of synthetic morphological feature vectors on hyperspectral image classification performance
This paper studies the effect of synthetic feature vectors on the classification performance of hyperspectral remote sensing images. As feature vectors, it has been chosen to employ morphological attribute profiles, that have proven themselves in this field. At this early stage of our work, the relatively simple Bootstrapping algorithm has been used for synthetic feature vector generation. Base...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009